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Abstract — In this paper, we consider the problem of energy 
efficient uplinli scheduling with delay constraint for a multi-user 
wireless system. We address this problem within the framework 
of constrained Markov decision processes (CMDPs) wherein one 
seeks to minimize one cost (average power) subject to a hard 
constraint on another (average delay). We do not assume the 
arrival and channel statistics to be known. To handle state space 
explosion and informational constraints, we split the problem into 
individual CMDPs for the users, coupled through their Lagrange 
multipliers; and a user selection problem at the base station. To 
address the issue of unknown channel and arrival statistics, we 
propose a reinforcement learning algorithm. The users use this 
learning algorithm to determine the rate at which they wish to 
transmit in a slot and communicate this to the base station. The 
base station then schedules the user with the highest rate in a slot. 
We analyze convergence, stability and optimality properties of 
the algorithm. We also demonstrate the efficacy of the algorithm 
through simulations within IEEE 802.16 system. 

Index Terms — multi-user Fading Channel, Constrained 
Markov Decision Process, Energy Efficient Scheduling, Learning 
Algorithm 



I. Introduction 

Broadband wireless networks like IEEE 802.16 [1] and 3G 
cellular [2] are expected to provide Quality of Service (QoS) 
for emerging multimedia applications. One of the challenges 
in providing QoS is the time varying nature of the wireless 
channel due to multipath fading [3]. Moreover, for portable 
and hand-held devices, energy efficiency is also an important 
consideration. 

For most wireless communication systems, the power re- 
quired to transmit 'reliably' for a given channel fading state 
is an increasing and strictly convex function of the trans- 
mission rate [3]. This suggests that energy efficiency can 
be achieved by transmitting the data at lower rates when 
the channel is bad, albeit at a cost of queuing delay, thus 
leading to a power-delay tradeoff. Furthermore in a multi-user 
wireless system, recent studies [4], [5] suggest that since the 
wireless channel fades independently across different users, 
this diversity can be exploited by opportunistically scheduling 
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the user with the best channel gain. This leads to significant 
performance improvement in terms of total system throughput. 
Such scheduling algorithms that exploit the characteristics 
of the physical channel to satisfy some network level QoS 
performance metrics are referred to as cross layer scheduling 
algorithms [6]. 

In this paper, we consider a single cell multi-user wireless 
uplink system. For such a system, we consider the problem of 
determining the user to be scheduled in each time slot along 
with its transmission rate. This scenario may correspond to 
scheduling users on the uplink in an IEEE 802.16 based system 
to satisfy delay constraint of each user. 

A. Related Work 

1) Energy Efficient Scheduling: The problem of energy 
efficient scheduling with delay constraint for a single user 
wireless channel has been explored in the pioneering work of 
[7]. Subsequently, the model of Berry-Gallager [7] has been 
extended with many generalizations on arrival and channel 
state processes in [8], [9], [10], [11], [12], [13], [14]. In most 
of these papers, the scheduling policy has been formulated 
as a control policy within the Constrained Markov Decision 
Process (CMDP) framework. However, only structural results 
of the optimal policy are available under various assumptions. 
Moreover, these results are applicable to only the single user 
scenario. There is very little work for extending the vast body 
of literature on single user delay constrained energy efficient 
scheduling to the multi-user scenario. 

In [15], the author does extend the analysis for single user 
case to multi-user case, albeit with only two users. Beyond two 
users, the problem becomes too unwieldy to gain any useful 
insight. This is primarily due to the large state space. For the 
two user case, the author has given an elegant near optimal 
policy where each user's rate allocation is determined by the 
joint channel states across users and the user's own queue 
state. Thus each user's queue evolution process behaves as if it 
were controlled by a single user policy. However, computation 
of user's transmission power still takes into account the joint 
channel and queue state processes. 

Recently, in a significant work [16], the author has extended 
the asymptotic analysis of Berry-Gallager [7] for exploiting 
the power-delay tradeoff to a multi-user system. The author, 
however, has considered the case of downlink scheduling, 
i.e, the base station scheduling users on the downlink. The 
objective is to minimize the total sum power subject to 
the users' queue stability constraints. Using the concept of 
Lyapunov Drift Steering, the author has given an algorithm 
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that comes within a logarithmic factor of achieving the Berry- 
Gallager power-delay bound. While [16] is one of the first 
serious attempts at multi-user energy efficient scheduling with 
delay constraint, it deals with sum power minimization on 
the downUnk. On the other hand, for the uplink, the problem 
is to minimize the average power of each user subject to 
individual delay constraint, subject to the additional constraints 
automatically imposed by the multi-user environment. This has 
not been addressed in the literature so far. 

Even for the single user case [8], [9], [10], [11], [12], 
[13], [14], practical implementation of optimal pohcy is far 
from simple. This is because a knowledge of the probability 
distributions of the arrival and channel state processes is 
required for computing the optimal pohcy. This knowledge is 
usually not available in practice. We have addressed this lim- 
itation by formulating an on-line algorithm within stochastic 
approximation framework in [17]. 

2} Other Multi-user Cross Layer Scheduling Schemes: 
While there has been little work in the area of multi-user 
energy efficient delay constrained scheduling, there is an abun- 
dance of literature on cross layer scheduling algorithms with 
other objectives. See [18] for a succinct review. A scheduling 
pohcy is considered stable if the expected queue lengths are 
bounded under the policy. Many scheduling poUcies proposed 
in the literature have considered stability as a QoS criterion. In 
[19], the authors have shown that the throughput capacity re- 
gion (as derived in [4]) is the same as the multi-access stabiUty 
region (i.e., the set of all arrival vectors for which there exists 
some rate and power allocation policies that keep the system 
stable.). A scheduler is termed throughput optimal if it can 
maintain the stability of the system as long as the arrival rate 
is within the stability region. Throughput optimal scheduhng 
poUcies have been explored in [4], [20]. Longest Connected 
Queue (LCQ) [21], Exponential (EXP) [22], Longest Weighted 
Queue Highest Possible Rate (LWQHPR) [23] and Modified 
Longest Weighted Delay First (M-LWDF) [24] are other well 
known throughput optimal scheduling policies. 

While throughput optimal scheduling policies maintain the 
stability of the queueing system, they do not necessarily 
guarantee small queue length and consequently lower delay. 
Delay optimal scheduling deals with optimal rate and power 
allocation such that the average queue length and hence 
average delay are minimized for arrival rates within the 
stability region under average and peak power constraints. It 
has been shown that the Longest Queue Highest Possible Rate 
(LQHPR) policy [25] (besides being throughput optimal) is 
also delay optimal for any symmetric power control under 
symmetric fading provided that the packet arrival process is 
Poisson and the packet length is exponentially distributed. 

Apart from throughput and delay optimal policies, oppor- 
tunistic scheduling which maximizes sum throughput subject 
to various fairness constraints have been explored in [26], [27]. 

B. Our Contributions 

In this paper, we consider the problem of opportunistic 
scheduling for a multi-user uplink system with power cost and 
individual delay constraints. Considered as a centraUzed con- 
trol problem of power minimization subject to constraints on 



average delays, this would be a special case of a constrained 
Markov decision process (CMDP). The traditional approaches 
for numerically determining the optimal policy in a CMDP 
framework are based on Linear Programming (LP) [28]. These, 
however, cannot be used for the problem posed in this paper 
because of the following reasons: 

1) Large state space: In our model, the system state space 
is large even for moderate number of users and the state 
space size increases exponentially with the number of 
users. We illustrate this with a simple example. Consider 
a system with 4 users. Assume that each user has a buffer 
of size 50 packets (assuming equal sized packets). The 
channel condition of each user can be represented using 
8 states, which is a practical assumption justified in [29]. 
For this scenario, the system state space contains 50** x 
8* = 2.56 X 10*° states. The computational complexity 
for determining the optimal policy (possibly based on 
the CMDP approach) is proportional to the state space 
size [30], [31] and thus increases exponentially with the 
number of users. 

2) Unknown system model: Computation of optimal 
scheduhng policy using traditional schemes based on 
LP assumes a knowledge of the system model, i.e., 
a knowledge of probabihty distributions of the arrival 
and channel state processes, for modeling the transition 
probability mechanism of the underlying Markov chain. 
This knowledge is not easily available in practice, so the 
exact model is not known. 

3) Communication overheads: In the multi-user framework 
considered here, there is also a cost on messages com- 
municated between users and base station, as these 
consume some of the available rate. Thus any proposed 
scheme should be low on these overheads, which works 
against any centrahzed scheme based on fuU state infor- 
mation. 

The issue of unknown system model can be resolved by using 
reinforcement learning (RL) algorithms [31] which 'learn' 
the optimal policy by performing approximate dynamic 
programming based on observed data. However, with such 
a large state space, the learning algorithms would take 
prohibitively large time to converge to the optimal scheduhng 
policy. One therefore has to address the issue of the large 
state space first and then employ the reinforcement learning 
algorithms appropriately. This provides us the motivation for 
designing multi-user scheduling policy as a combination of 
single user policies that search over a relatively small state 
space. This is achieved by artificially spUtting the problem 
into several single user problems for the individual users 
and the base station, which are coupled, but in a relatively 
simple and manageable manner. This reduces the complexity 
to linear in the number of users. 

Thus in our approach, each user behaves as though it is 
facing a single user optimization problem and comes up with 
a desired rate. This is then communicated to the base station. 
The base station then schedules the user with the highest rate 
requirement in a slot. The intuition behind this is that this 
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will favor the user with greatest need, be it because of a 
favorable channel or a high queue length. At the same time, 
when a user with lower rate requirement is not allocated the 
channel for a while, its queue length and therefore the rate 
requirement will go up and it will eventually be allocated the 
channel. The learning algorithm puts a penalty on violation 
of the delay constraint. This implicitly couples individual 
decisions, as the users are sharing a common channel. 

The scheduling algorithms proposed in the literature like 
EXP [22], LQHPR [25], M-LWDF [24] etc., also require 
the queue length information for determining the scheduling 
decision. In the downlink scenario, this information is 
readily available to the scheduler residing at the base station. 
However, in the uplink scenario, this information needs to be 
communicated by the users to the scheduler. Communicating 
the queue length information poses a significant overhead. 
In our approach, each user needs to communicate only the 
desired rate. In a practical system, we may have few possible 
rates, say eight. This means that we may need only 3 bits of 
information to be conveyed. 

Through our simulations in an IEEE 802.16 system, we 
demonstrate that the algorithm is indeed able to satisfy the 
delay constraints of the users. Moreover, we demonstrate that 
the power expenditure of a user is commensurate with its 
delay requirement, average arrival rate and average channel 
condition. 

The contributions of this paper can be summarized as 
follows: 

1) We propose a novel scheduling algorithm for minimizing 
the average power expended by each user subject to 
a constraint on individual user delay in a multi-user 
uplink wireless system. This algorithm does not require 
knowledge of the probability distributions of the channel 
states and the arrivals. 

2) We analyze convergence, stability and optimality prop- 
erties of the proposed scheme. In particular, we show 
certain desirable properties such as Pareto optimality and 
an interpretation as 'Markov equilibrium' of a stochastic 
game. We also argue incentive compatibility of the 
scheme. 

3) We demonstrate applicability of the algorithm within 
IEEE 802.16 framework. Our simulation studies involv- 
ing comparison with M-LWDF scheduler demonstrate 
that the proposed algorithm is power efficient. We also 
study performance of the scheme under different 'infor- 
mation accuracies', i.e., with different number of bits 
for conveying the desired rate. 

The rest of the paper is organized as follows. In Section II, 
we present the system model. In Section III, we formulate the 
multi-user scheduling problem. In Section IV, we propose an 
on-line learning algorithm for the users and a user selection 
algorithm at the base station. We also discuss the imple- 
mentation issues. In Section V we analyze properties of the 
algorithm such as convergence, queue stability and optimality. 
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Fig. 1. System Model 



We present the simulation setup and results in Section VI. 
Finally, we conclude in Section VII. 

II. System Model 

We consider uplink transmissions (as in Figure 1) in a Time 
Division Multiple Access (TDMA) system' with N users, i.e., 
time is divided into slots of equal duration and only one user is 
allowed to transmit in a slot. We assume that the slot duration 
is normalized to unity. The base station is a centralized entity 
that schedules the users in every slot. We assume a fading 
wireless channel where the channel gain is assumed to remain 
constant for the duration of the slot and to change in an 
independent and identically distributed (i.i.d.) manner across 
slots. This model is termed as the block fading model [7]. We 
assume that the channel gains across users are also i.i.d. Under 
these assumptions, if a user i transmits a signal in slot n, 
then the received signal can be expressed as. 



— ^nVn + 



(1) 



where denotes the complex channel gain due to fading 
and Gn denotes the complex additive white Gaussian noise 
with zero mean and variance iVo. Let = be the 

channel state for user i in slot n. Usually, (and hence X,* ) 
is a continuous random variable. However, in this paper we 
assume that takes only finite and discrete values from a 
set X. This assumption has been justified in [7], [8]. In this 
paper, we assume that the distribution of is unknown. 

We assume that the users' packets are of equal size, say, 
T bits. Packets arrive into the user buffer of infinite capacity 
and are queued until they are transmitted. The packet arrival 
process for each user is assumed to be i.i.d. across slots. Let 
denote the number of packets arriving into the user i buffer 
in slot n. We assume that the random variable takes values 
from a finite and discrete set ^ = {0, . . . , A}. Like X^, we 
assume that the distribution of A^^ is unknown. 

Let Qj^ G Q " {0, 1, . . .} denote the queue length or buffer 
occupancy of user i in slot n. Let denote the number of 
packets actually transmitted in slot n (by user i). We assume 
that takes values from the set U = {0, 1, . . .}. Let /* be an 
indicator variable that is set to 1 if user i is scheduled in slot 

'The assumption of TDMA does not restrict tlie applicability of this 
formulation to any orthogonal multiple access system. 
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n and is set to otherwise. Let I„ be the vector [7^^, . . . , I^]. 
Note that since only one user can transmit in a slot, only one 
element of I„ is equal to 1 and the rest are 0. Let I be the set 
of all possible N dimensional vectors with one element equal 
to 1 and the rest being 0. Let R\ &U denote the number of 
packets that user i should transmit in a slot if it is scheduled. 
Then [7^' can be represented as C/^ = I^R]^- Moreover, since a 
user can at most transmit all the packets in its buffer in a slot, 
Rn < Qn- Since we assume that the slot length is normalized 
to unity, C/,j also represents the rate at which user i transmits 
in slot n. Let U„ be the vector [U^, . . . ,U^], \Jn eU^ ■ 

The queue evolution equation for user i can be expressed 
as, 

Q'n+i = Q\-ul + K+i- (2) 

For most communication systems, the power required for 
reliable communication at a rate R\^ = u packets/sec when 
— X, denoted as P{x,u), is an increasing and strictly 
convex function of u. Let P denote the maximum power with 
which a user can transmit in any slot. Let W{x^) be the 
maximum number of packets which user i can transmit in 
a slot when the channel state is .x* when transmitting with 
power P. Then the set of feasible actions for user i in slot n, 
r^ = {Q,...,uAn{R\x\,),Ql)}. 

We assume that the users specify their QoS requirements in 
terms of the average packet delay requirements. These delay 
requirements of the users are known a-priori to the scheduler. 
By Little's law [32], the average delay D is related to the 
average queue length Q as, 

Q = aD, (3) 

where a is the average arrival rate. In the rest of the paper, we 
treat average delay as synonymous with average queue length 
and ignore the proportionality constant o. 

III. Problem Formulation 

In this section, we formulate the multi-user scheduling 
problem. 

A. Formulation as a Constrained Optimization Problem 

The problem considered here is to design a scheme for 
scheduling a user in each time slot and also the rate (i.e., 
number of packets to be transmitted) that minimizes the aver- 
age power expenditure of each user subject to the satisfaction 
of individual delay constraint. The average power consumed 
by a user i can be expressed as: 




The average queue length of user i can be expressed as: 

1 ^ 

= limsup— E^^Qj,. (5) 

Each user i wants its average queue length to remain below a 
certain value, say, S^. Our objective is to design a scheduling 
algorithm that minimizes P' for each user i subject to a 



constraint on Q*. Thus the scheduler objectives can be stated 
as. 

Minimize subject to <S\ i=l,...,N. (6) 

Remark 1: Note that there are actually N problems in (6). 
However, these problems are not independent. This is because 
in a TDMA system, only one user can be scheduled in a slot. 
Consequently, the scheduling decision in a slot impacts the 
buffer occupancy of all the users in future slots. 

B. Notion of an Optimal Solution 

The problem in (6) is an optimization problem with N 
objectives and N constraints. There can be multiple aver- 
age power vectors that can be considered as optimal. Let 
[P^, . . . , P^] denote the average power expended under the 
scheduling policy We say that the scheduling policy $ is 
Pareto optimal if and only if there exists no policy i? with the 
corresponding power expenditure vector [P^, . . . , P^] having 
the following properties. 

Vie {!,..., iV}P^<P4 A 3iG {!,..., 7V}|P^<P4. (7) 

We seek a Pareto optimal solution in this paper. 

In the next section, we present a solution strategy wherein 
we split the problem into N + 1 parts: N user problems and 
one base station problem. 

IV. Solution Strategy: Decomposition into User 
AND Base Station Problems 

In this section, we propose the following solution strategy: 
we view the problem as N (dependent) user problems and 
one base station problem. The user i problem is to determine 
a rate at which it desires to transmit in a slot so as to solve 
the problem specified in (6). Since the channel and arrival 
statistics are not known, in order to address this problem, 
the users resort to a 'learning' approach discussed below. The 
users' desired rates are then conveyed to the base station. The 
base station problem is to select a user in each slot such that 
the average power expenditure vector is a point on the Pareto 
frontier, i.e., it is Pareto optimal. Next, we present the on-line 
learning algorithm at users. 

A. Learning Algorithm for the Users 

We consider a modified version of the on-line algorithm 
proposed in [17] that determines the transmission rate in every 
slot for each user. Once the on-line algorithm has determined 
the rate R\^ & J^n' the transmitter at a particular user executes 
this action if the channel is allocated to it, otherwise it is 
unable to proceed with the transmission. If the transmitter is 
not able to proceed with the transmission, the packets remain 
in the queue. Under the given model, the queue evolution 
equation can be expressed as, 

Qri+l ^ Qn + ~ ^n^ln (8) 

where = 1 if the transmitter i actually transmits the packets 
(i.e., if user i is scheduled in a slot n), else, = 0. Let Q„ 
and X„ denote the vectors [Q^, . . . , Q^] and [X^, . . . , X^j 
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respectively. The state of the system S„ at time n can be 
described by S„ = (Qn, X„) comprising of the queue lengths 
and the chaimel states. The control variables are I^, R.^ — 
[ii^, . . . , Rn], of which the transmitters independently choose 
the corresponding components of R„ and the base station, 
which is doing the channel allocation, chooses I„ subject to 
the constraints that e {0, 1} and J2i 4 = 1- 

Our learning policy, however, mandates that once transmitter 
i has determined the rate i?^, it updates its power cost 
irrespective of whether packets have been transmitted or not. 
That is, it operates as though its cost is 



1 ^ 

P: = limsup-E5]P(X;,i?;). 



(9) 



Thus the problem that user i addresses can be specified as: 

Minimize subject to Q' <6\ i = l,...,N. (10) 

This problem is a CMDP with average cost criterion. The 
objective is to determine an optimal policy /i* * such that 
the power cost (9) is minimized while satisfying the delay 
constraint. Note that the policy considered here minimizes the 
single user cost exactly as in the single user case of [17] as 
specified in (9), not the actual power cost (4). See the Remark 
2 at the end of this sub-section and the subsequent sections 
for further discussion on this aspect. 

1) The Primal Dual Approach: The constrained problem 
in (10) can be converted into an unconstrained problem using 
the Lagrangian approach [28]. We focus on the ith user. Let 
A* > be a real number termed as the Lagrange Multiplier 
(LM). Let c : TZ+ X Q X X X U ^ TZ he defined as the 
following. 



(11) 



where i?^ is determined using the rate allocation policy /u* : 
Q X X U. The unconstrained problem is to minimize: 

1 ^ 

L{„\ A') = limsup -E ^ c{X\ Ql n\Q\, XD). 



M— »oo 



(12) 

•) is called the Lagrangian. Our objective is to determine 
the optimal rate allocation policy /jx' * and optimal LM A''* 
such that the following saddle point optimality condition [33] 
is satisfied: 



L(/x''*,A') < L(/x''*,A''*) < L(/i',A''*). 



(13) 



For a fixed LM A\ the problem is an unconstrained Markov 

Decision Problem (MDP) with the average cost criterion. Let 
p{{q, x), r, (g', x' j) denote the probability of reaching a state 
{q',x') upon taking an action r in state {q,x). Let V^''^{-, •) 
denote the value function under policy /i, i.e., V'^''^{q,x) 
denotes the expected cost for a state (g, x) under a policy /x. 
Let ■) denote optimal value function. It can be expressed 
as: 



V\q,x) =mmV''''{q,x). 



(14) 



The following dynamic programming equation [30] gives a 
necessary condition for the optimality of a solution: 



V''{q,x) = min c{X\q,x,r) - l3 + 
J2p{{(I,x),r,{q',x'))V\q',x') 



(15) 



where a' & A, x' € X, V^{-, ■) is the value function, (3 G TZ 
is the unique optimal power expenditure and ((7°, G QxX 
is any pre-designated state. If we impose V' {q''\ .t°) = 0, then 
V^{-,-) is unique [30]. The Relative Value Iteration Algorithm 
(RVIA) [30] is a known approach for determining the optimal 
value function. It can be expressed as: 



V^_f_i{q,x) = min c{X\q,x,r 



y„'(g°, x') + ^ piiq, x), r, {q', x'))V},{q' , x'). 

q' ^x' 



(16) 



Note, however, that RVIA (16) requires the knowledge of 
•, •). This depends on the probability distributions of the 
arrivals and channel states; which are not known. Moreover, 
determining the optimal value function as defined in (15) is not 
sufficient because the unconstrained solution for a particular 
A' does not ensure that the constraints would be satisfied. 
To ensure constraint satisfaction, the optimal LM needs to be 
determined. 

2} The Post-decision State Formulation: To address the 
difficulty posed by unknown ■, ■), we introduce a stochas- 
tic approximation version of the RVIA. The RVIA above, 
however, is not suited for this because the conditional ex- 
pectation w.r.t. the transition probability is inside the (non- 
linear) minimization operator. This prompts us to replace the 
state variables by 'post-decision state' variables^ 

{Qi,Xi) defined by Q\ ^ Q\ - C/^l; = X^ Let C(-) 
denote the unknown law for the arrivals and the tran- 

sition probability kernel for the channel states. The dynamic 
programming equation and the RVIA for post-decision states 
become 

V\q,x)=Y,C{a')n{x'\x)x 

a' ^x' 

c{\\q + a', x', r)- p + V\q + a' -r, a;')] , (17) 

a' ,x' 

c{X\ q + a', x\ T) - V^{q\sP) + F^(g + a' - r, x')] , (18) 



mm 



and 



mm 

where (g^ji") is any pre-designated reference post-decision 
state. On the right hand side in (18), the value function 
corresponding to this state is subtracted in order to keep the 

^This is a special construct that is possible when the controlled transition 
naturally splits into the control action followed by the action of noise. It has 
the advantage that the corresponding dynamic programming equation has the 
conditional expectation operation outside the minimization operation. This 
facilitates a stochastic approximation version of it, which is the learning 
algorithm. See [34] for an extensive accovmt of the post-decision state 
formalism. 
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iterates bounded. In the next section, we present an on-line 
algorithm based on a reformulation of this RVIA that does 
not require the knowledge of •, •). The approach is similar 
to that in [17]. We also augment the reformulated RVIA with 
an LM iteration to ensure constraint satisfaction. 

3) The On-line Rate Allocation Algorithm: In this section, 
we present the on-line algorithm which we call the rate 
allocation algorithm. This algorithm determines the rate with 
which the user should transmit in a slot and updates the value 
function and LM. 

Let {/„} and {e„} be two sequences that have the following 
properties. 



E 



E 



(/n)^ J2^enf < 00, 
n 

In = OO, en = 00, 



lim — 

In 



0. 



(19) 
(20) 
(21) 



The significance of these properties has been explained below. 
Let the user i's state at the beginning of slot n be 5^ = 
= {q,x). Suppose that u packets are transmitted 
in slot n. The following primal-dual algorithm can be used 
to compute the rate ii^+i = rjj_|_j at which the transmitter 
should transmit in slot n + 1, 

r;+i = arg min |(1 - /„)V;'(g,i) /„ x 

{c(A;,q + A:,+i,x;+i,«) 

-V:{q°,x')}}, (22) 

K+i{Q,i) = {l-fn)VMS:)+fnX 

|c(A^, q + ^^+1, r^_|_i) 

-V:{q°,x°)}, (23) 

K+i = T[Xi + er,{Ql,-S% (24) 

where r[ ] in (24) is a projection operator that projects the 
A' iterates in the interval [0,JC] for /C >> 1. This is to 
ensure boundedness of these iterates. Components other than 
the (g, a;)th in (23) are left unchanged. These equations are 
explained below: 

1) (22), (23) and (24) constitute the rate allocation al- 
gorithm or the 'user algorithm'. It consists of two 
phases: rate determination phase and update phase. (22) 
constitutes the rate determination phase of the algorithm, 
i.e., it is used to determine the rate at which a user 
should transmit in a slot. (23) is a primal iteration to 
determine the optimal value function for post-decision 
states and thereby the optimal policy, while (24) is a 
coupled dual iteration for determining the optimal LM. 
They constitute the update phase of the algorithm. 

2) The sequences /„ and e„ have properties specified in 
(19), (20) and (21). The properties of the update se- 
quences in (19) ensure that the sequences {/„} and {e„} 



converge to zero sufficiently fast to eliminate the noise 
effects when the iterates are close to their optimal values 
y'(-, •) and A''*, while those in (20) ensure that they do 
not approach zero too rapidly to avoid convergence of 
the algorithm to non-optimal values. Furthermore, (21) 
ensures that the update rates of primal, i.e., the value 
function iterations and the dual, i.e., the LM iterations 
are different. Since e„ approaches much faster than 
/„, the update rate of the value function iterations is 
much higher than that of the LM iterations. Using a two 
time scale analysis, we show in Section V-A that even 
though both the primal and dual variables are updated 
simultaneously, both converge to their optimal values. 
Remark 2: Note that the r.h.s. of (23) is nothing but the 
actual minimum of the expression being minimized on the 
r.h.s. of (22). That is, the relative value iteration is being 
performed for the single user cost exactly as in the single 
user case of [17] specified by (9), not for the power cost (4). 
As we argue later, it will in fact converge to the single user 
optimum rate required /or the current quasi-static value of the 
Lagrange multiplier. This is the rate the user will request, it is 
not necessarily the rate actually allocated to it. The difference 
with the single user case of [17] comes from the fact that 
these relative value iterations for users are coupled through the 
Lagrange multiplier updates which are indeed affected by the 
actual transmission scheme. As we see later, the convergence 
of (24) perforce implies that the constraints are met. Given 
the fact that not the full requested transmission rate is actually 
granted, that leads to the conclusion that a higher rate will be 
requested than in the single user case in [17], which in turn 
means larger limiting values for the A's compared to those in 
[17]. We call this the multi-user penalty that a user pays in 
the multi-user environment. Further discussion on this can be 
found in Section V-Dl. 

A convergence analysis of the user algorithm in (22), (23) 
and (24) is presented in Section V-A. Each user i determines 
the rate R\^_^i — ^n+i which it would transmit in slot 
n + 1 if it is scheduled, specified by the algorithm described 
above, and communicates this rate to the base station. The 
base station employs the user selection algorithm to schedule 
a user. The users update their value functions in each slot 
assuming successful transmission regardless of whether they 
are scheduled by the base station or not. Note, however, 
that for the user who is actually scheduled, a corresponding 
queue transition occurs because of the transmitted packets. 
This influences the queue length and consequently the LM 
update. 

B. User Selection Algorithm 

The user selection algorithm schedules the user with the 
largest i?*,. If more than one user has the largest rate then 
a user is selected at random from among them with uniform 
probability. The intuition behind selecting the user with the 
largest rate is the following. The rate allocation algorithm of 
a user i would direct it to transmit at a high rate i?^ under 
two circumstances: either the channel condition for that user is 
very good, in which case transmission at high rate saves power. 
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Fig. 2. Solution schematic 

or the queue length of that user is large. Thus selecting a user 
with a high rate results in either power savings or reducing 
the queue length of that user and thereby ensuring that the 
delay constraint of that user is satisfied. An analysis of the 
user selection algorithm is presented in Section V-Cl. 

C. Algorithm Details and Implementation 

The rate allocation algorithm is implemented on the user 
devices while the user selection algorithm is implemented at 
the base station, as illustrated in Figure 2. From (22), note that 
the rate determination phase requires i.e., the knowledge 
of the channel state at the base station. The communication 
overhead incurred by the base station in informing a user about 
the channel state perceived by it depends on the number of 
states used to represent the channel. We represent the channel 
using 8 states. Thus the base station needs 3 bits per slot in 
order to inform a user about the channel state perceived by it. 
The users inform the base station about the computed rates. 
We allocate 3 bits for conveying this information, i.e., the 
system can employ 8 rates. The user selection algorithm then 
determines the user to be scheduled and all users are informed 
about this decision. The rate allocation algorithm at each user 
then enters the update phase where the value function and the 
LM for each user are appropriately updated using (23) and 
(24). The algorithm thus continues in each slot n. The rate 
allocation algorithm that is executed at each user device is 
illustrated in Algorithm 1 where, steps 8 — 13 represent the 
rate determination phase, while steps 19 — 24 represent the 
update phase. The user selection algorithm executed at the 
base station is detailed in Algorithm 2. 

D. Discussion 

1) Computational complexity: The computational complex- 
ity of the rate allocation algorithm executed at a user 
device is independent of the number of users in the 
system. This is because the rate allocation algorithm 
for any user i is dependent on the user i state only 
and is independent of the states of the other users. The 
user selection algorithm has to determine the maximum 
of N numbers and hence is linear in N . Thus the 
computational complexity of the user selection algorithm 
grows only linearly with the number of users. 

2) An auction interpretation: The above scheme can be 
interpreted as an auction, where the user selection al- 
gorithm auctions each time slot. The users bid in the 



1: Initialize the value function matrix x) 



Vg e 



Initialize LM A* ^ 
Initialize slot counter n ^ 1 
Initialize queue length ^ 
Initialize channel states x' ^ 0, a:* ^0 
Reference state = (0,2:^), where G X 
while TRUE do 

while Base station has not informed the channel state 



do 



= a' in the 



n+l - 

9: wait 
10: end while 

11: Determine the number of arrivals A\ 
current slot 

12: Determine the queue length in the current slot Q!^ — 
13: Use the rate determination phase of the rate allocation 
algorithm, i.e., (22) to determine the rate r*, for trans- 
mission 

14: Determine the power ) required to transmit r* 

packets 

15: Inform the base station of the rate 

16: while Base station has not scheduled a user do 

17: wait 

18: end while 

19: Update the component (g\ x*) of the value function 

using (23). Rest of the components remain unchanged 
20: Update the LM A' using (24) {Q'„ = q') 



q ^ q 

X* ^ x' 

end while 



Algorithm 1: The Rate Allocation Algorithm at the User i 
Device 



n+l 



X* in the current 



while TRUE do 
for i e 1, . . . ,iV do 

Estimate the channel state X, 
slot for user i 
Inform x* to user i 
end for 

while Rate of each user is not known do 

wait 
end while 

Determine the user k who has the highest rate 
Schedule user k in the current slot 
end while 



Algorithm 2: The User Selection Algorithm at the Base 
Station 



form of their transmission rates to the user selection 
algorithm, which allocates the time slot to the user 
bidding the highest rate. The rate bid by a user is 
dependent on its channel state and queue length. If the 
channel state is good and/or the queue length is large, 
the user bids a high rate. This is because transmitting 
at a high rate when the channel state is good saves 
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power, while doing so when the queue length is large 
aids in satisfying the delays. Note that the users do not 
bid unnecessarily high rates because that might result in 
higher power consumption. For a user, not winning an 
auction in a certain slot imphes that other users either 
have better channel conditions or relatively higher queue 
length or both. If a user does not win the auction for a 
certain number of slots successively, its queue length 
grows thus forcing it to bid a higher rate. 

V. Analysis of the Multi-user Scheme 

In this section, we present an analysis of the multi-user 
algorithm presented in the paper. Specifically, we comment on 
queue stability, convergence and optimality of the algorithm. 
We begin with certain conditions that are necessary for the 
convergence of the algorithm. 

Let <;(s',n) be the number of times that user i state 
is visited up to time n. The state process should satisfy the 
following property: 

liminf ' ^ > 0, Vs',i. (25) 

n — >oo Ji 

(25), which is akin to positive recurrence for uncontrolled 
Markov chains, is a stability condition and is essential for 
proving the convergence of our algorithm. We shall prove that 
the queues are indeed stable, implying that (25) is indeed 
satisfied, if we assume that the users have already learnt 
their policies, i.e., the user learning component of the overall 
scheme has converged. On the other hand, this convergence in 
turn requires a priori proof of stability. This circular situation is 
very common in adaptive and learning control (see, e.g., [35]) 
and a procedure that assures its resolution is yet unknown. 
There are however certain heuristic approaches that work well 
in practice. In the present case, for example, one can impose 
an initial 'pure learning' phase wherein one employs a known 
stable strategy for a finite duration instead of the 'self-tuning' 
policy which bases its decision on the current guess for the 
value function, and then switches to the latter. This will ensure 
that the latter, when finally inducted, gives good guesses for 
the value functions so that the corresponding guesses for 
actions are close to optimal, in particular stable. Another 
possible solution is to use a known stabilizing pohcy when 
the state process blows up, i.e., crosses a very large threshold, 
and use the policy proposed herein otherwise. Our simulations, 
however, use the original scheme and the simulation results are 
quite promising. 

A. Convergence Analysis 

Note that each user in the system has coupled iterations 
comprising of the value function and LM. We need to prove 
that these coupled iterations converge to an equilibrium set for 
each user. The main steps in the analysis are as follows: 

1) We make an assumption on the stabiUty of the queues 
under the closed loop scheme, as noted above. 

2) We then analyze the convergence of the value function 
for each user for an almost constant value of the LM. 
Since the value function of each user is updated in each 



slot regardless of whether the user is scheduled in that 
slot or not; the value functions are decoupled across 
users. This is in the spirit of the decoupling of static 
formulations of network flow problems via the Lagrange 
multiplier as in [36], except that here it is mandated by 
our algorithm. The decoupUng is facilitated by the fact 
that the users compute their value function as though the 
cost is (9) and not (4). The latter would have introduced 
a more direct dependence on other users, over and above 
that through the LM. 
3) Finally, we prove that the LMs and the coupled iterates 
of the users also converge. Convergence of the LMs 
implies that the delay constraints are satisfied and vice 
versa. This imphes that if there is sufficient capacity, 
the multi-user scheduling satisfies the delay constraints 
of all the users. 
We now prove that the value functions converge. 

Theorem 1: For the rate determination algorithm (22), (23) 
and (24), the iterates (V^,A^) converge to the optimal val- 
ues, i.e., (V^,AJj) ^ (y,A*'*). Moreover, convergence to 
equilibrium implies that the delay constraints of the users are 
satisfied. 

Proof: The proof for individual user's algorithm is similar 
to that of the single user algorithm in [17]. We sketch it 
in outline below, referring to the relevant portions of [37] 
for details. The arguments are based on the well known 
'o.d.e.' approach for the analysis of stochastic approximation 
algorithms, wherein one looks at the algorithm as a noisy dis- 
cretization of a hmiting ordinary differential equation (o.d.e.) 
which can be written by inspection. One treats the 'learning 
parameters' or stepsizes as discrete time steps and compares 
the linearly interpolated iterates with the o.d.e. trajectory from 
some time on. The assumptions such as (19) and (20) on the 
stepsizes ensure (under suitable hypotheses) that errors due to 
both discretization and noise are asymptotically neghgible and 
therefore the iterates a.s. track the asymptotic behavior of the 
o.d.e. See Chapter 2 of [37] for the general idea of proof. We 
spell out some details below that are specific to this paper 
1) Our requirement that ^ induces two time scales, a 
fast one for (23) and a slow one for (24). Let /i'(V"') = 

h\^^{V'') = '^C{a)K\x'\x) xmm[c{X\q + a,x,u) 

a.x' 

+ V\q + a-u,x')-V'{q\x'')], 

where (g^, x^) is any pre-designated state. Using the two 
time scale analysis in [37], Section 6.1, we first analyze 
(23) by freezing « a constant A* and considering the 
limiting o.d.e. for (23) given by 

V'(f)=K'{t){hnV\t))-V\t)), (26) 

where A*(f) is a diagonal matrix with nonnegative 
elements summing to 1 on the diagonal. The diagonal 
elements of A'(f) reflect the relative frequency with 
which the system states {q,x) are visited, and there- 
fore the corresponding components of the iteration are 
updated. Barring the diagonal matrix A'(f), this is just 
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what one expects. The occurrence of A'(t) is due to the 
asynchronous iteration - we update only one component 
of at a time. See [37J, Section 7.2, for a description 
of how this comes about. 

2) Suppose that the queues remain stable. Coupled with the 
recurrence of the Markov chain describing the channel 
state, this implies that the empirical frequency of any 
possible post-decision state value remains bounded away 
from zero with probabihty one. This impUes that the rel- 
ative frequencies with which the states {q, x) are visited 
are of the same order of magnitude. Hence, the diagonal 
elements of A*(t) remain uniformly bounded away from 
zero (see [37], p. 87, for a more formal argument). One 
also needs that a particular user is scheduled frequently 
enough by the base station. Our stability assumption will 
be seen to ensure this automatically: a user not selected 
for long builds up a large queue and hence requests 
higher rate, which favors its selection thereafter. Were 
this not so, that queue would have become unstable. 
In fact this can still happen if the maximum permissible 
rate for one user is way less than that for others, in which 
case it will be starved of transmission opportunities 
most of the time and go unstable. We assume that the 
maximum allowable rates are sufficiently large for the 
arrival rates under consideration and comparable (among 
users) so that this does not happen, in fact we later 
impose a condition ((30) below) stronger than this. 

3) We now prove that the value function converges to its 
optimal value V\ i.e., — > V\ 

Lemma 1: If the diagonal elements of A'(t) remain 
uniformly bounded away from zero, —*V^. 

Proof: We first need to prove that the iterates 
remain bounded a.s. We adapt the arguments of Section 
3.2, [37]. The situation is slightly different here, viz., 
we have a time-dependent matrix A'(f) on the r.h.s. 
of (26) whereas Section 3.2 of [37] analyzes a time- 
homogeneous case (compare (26) in the present paper 
and (3.2.1) in [37]). Nevertheless, it does not affect the 
argument as long as the diagonal elements of A*(t) are 
uniformly bounded away from zero. As in Section 3.2 of 
[37], (especially the development after (3.2.2)), consider 
a 'scaled limit' of the o.d.e. (26) wherein in the r.h.s. 
of (26) the function is replaced by ft^ defined by 
K'^{x) = limaioo This corresponds to (26) again, 

but with c(-) = 0, i.e., the immediate cost function is set 
to zero. It can be shown as in [17], Lemma 1, that this 
scaled o.d.e. has the origin as the globally asymptotically 
stable equilibrium. This ensures a.s. boundedness of 
the iterates by the results of Section 3.2 of [37]. The 
intuition is as follows: if the iterates become unbounded 
along a subsequence, a scaled version thereof, suitably 
interpolated, begins to approximate the hmiting o.d.e. 
above and therefore has to return towards the origin by 
the asymptotic stability of the limiting o.d.e. Since these 
differ from the original iterates only by a scale factor, 
one can argue that the original iterates themselves start 



moving towards a bounded set and therefore cannot blow 
up. Section 3.2 of [37] makes this intuition precise. 
One can then argue as in [17] (development just before 
Lemma 1) to conclude that V^{t) converges to the 
solution y* corresponding to = /3, where 

/? is the optimal average cost per stage. As in Section 
2.1, [37], it then follows that V' a.s. ■ 

4) Note that the above analysis treats AJj « a constant, 
so what we have really proved (cf. Section 6.1, [37]) 
is that {V^} closely tracks {V^'^'^}, where V^'^' 
is with its A-dependence made explicit. To be 
precise, — y*'-^" a.s. The following lemma 
then proves that the LM converges to its optimal value 
A*'*, and hence the pair (V^, A^) converges to {V\ A*'*). 

Lemma 2: The LM iterates Xl^ converge to optimal 
value A*'*. 

Proof: The proof is similar to that in [17], Lemma 
4 and Corollary 1. Note that the and A^ iterations 
are primal-dual iterations. The primal iterations perform 
relative value iteration and determine a minimum of 
the Lagrangian with respect to the policy for an almost 
constant LM. The limiting o.d.e. for the A^'s is a steepest 
ascent for the Lagrangian minimized over the primal 
variables (See Lemma 4 of [17]), a fact that can be 
justified by using the 'envelope theorem' [38], which 
allows one to interchange the 'max' and the gradient 
operator. By standard results for stochastic gradient 
ascent for concave functions (see Section 10.2 of [37]), 
this converges to the optimal LM A''* as argued in [17] 
Lemma 4 and Corollary 1 . ■ 
Lemmas 1 and 2 imply that {V^,Xn) {V\X^'*) as 
required. 

The LM iteration can be considered to be a noisy 
discretization of a limiting o.d.e. which is driven by 
the shortfall from or the excess over the allowed mean 
traffic, of the actual mean traffic. Hence it is clear that 
the convergence of LM to an equihbrium of this o.d.e. 
impUes that the constraints are met with equality. 



B. Queue Stability 

In this sub-section we prove (under suitable conditions) 
that under the strategies learnt by individual users, the queues 
do remain stable. This is done using a stochastic Lyapunov 
argument. This does, however, presuppose that the users have 
learnt the correct strategies already, i.e., the learning scheme 
has already converged. As noted earher, this does not prove 
closed loop stability with the learning scheme. Our simulation 
results, however, indicate a stable behavior even with the 
learning component kicking in. 

Let V(q) = X^. q\ Y = H^i]- Let w denote the (unique) 
stationary distribution of the Markov chain {X„}. Let R = 
minj J2:x. T^{^)R^{x^ )- 

Recall that Rl^ depends on {Ql^,X^). Suppose that 
Ri = tiQlX}^) for some Write ^(Q„,X„) for 
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E[V(Q„+i)| ]-V(Q 
= E[^(Qt.+i-gt,)|Q„,X„] 

i i 
i 

+ (27) 

i 

If t then i^{Ql^, X^) | by monotonicity of optimal single 
user policy in the queue length [10]. It is reasonable to expect 
that as g'* t> 

e{q\x') ^ R\x'). (28) 
This motivates our assumption: 

maxe\q\x')> R (29) 

i 

outside a bounded set of q's. That is, if queue length(s) blow 
up, the system at the very least transmits at rate R. We also 
assume: 

Y,^'<R. (30) 

i 

This implies that whenever Q„ is outside a certain large 
bounded set, 

E[V(Q„+i)|Q„,X„]-V(Q„) 

= -i?+^y<o. (31) 

i 

Thus V(-) serves as a stochastic Lyapunov function for the 
queue thus ensuring its stability (i.e., positive recurrence - see 
[39], Chapter 13). (Note that it suffices to have the Lyapunov 
function depend on queue lengths alone, as the channel state 
is in any case finite valued.) 

The condition (30) ensures that the minimum of the max- 
imum rate allowed to each user is itself adequate to handle 
aU traffic. Looking at the aggregate queue length process 
Qn '■= J2i Qn' which satisfies 

i i 

it is clear that such a condition would be sufficient. Nev- 
ertheless it is rather strong and we suspect that it can be 
significantly weakened. 

Under (30), we can say more than just positive recurrence, 
viz., we characterize the set near which the queue lengths will 
concentrate, in an approximate way. For technical reasons, for 
the purposes of this proof we assume that at each time n the 
base station uses a smooth approximation F := [F^, .... F^] 
of the maximum function for channel allocation, i.e., it allo- 
cates the channel to the user with the highest demand (say, ith) 
with a probability F^(Rn) close to one, but does allocate it 
to others with a small but nonzero probabiUty i^-' (R„), j ^ i. 
Specifically, let i^^(R„) := giK^/iEj 9iRiD for some 
monotone increasing and smooth g and m >> 1. Also, we 



assume that the £^{-,-) above is continuously differentiable. 
These hypotheses avoid some difficulties in the argument be- 
low that would be caused by having to deal with a differential 
equation with a discontinuous right hand side. This is at the 
expense of being only approximate. 

The queue evolution (2) for user i can be rewritten as: 

Qn+l = Q^ + (f -F'(R„)i?j.) + ((Aj,+i-f) 

+ iriR^)Ri-wj). 

LetF =[F\..., F^] be defined by: F*(r) = F^{ry \fi. 
Note that this is continuously differentiable. Then the queue 
evolution is of the form: 

Q'n+i ^Qn + - F\^n)) + M„+i, n > 0, (32) 

where {M„} is a martingale difference sequence. 
Let q = [gS . . . , g^]. Define F = [F^, . . . ,F^] by: 

F'(q) = 5^F^(£(q,x))^^(g\a;')7r(x). 

X 

Consider the scaled version of (32), given by 

QUi =Qn + ^ii-y" - F^^n)) + n > 0, (33) 

for a small r/ > 0. If we consider smaller and smaller time 
slots of width T] with 7', and F' being 'rates per unit time' 
rather than 'per slot' quantities, then we obtain the 'fluid' 
approximation of (32) , which is given by the o.d.e. 

q\t) = y - F^ait)), (34) 

restricted to the positive orthant (i.e., it follows the projected 
dynamics on the boundary of this orthant). 

Lemma 3: The trajectories of the o.d.e. in (34) for each i 
approach a bounded set. 

Proof: The proof is similar to the stochastic Lyapunov 
argument above (Refer to (31)). Let f2„ = { maximizers of 
Rl}^ for m » l,i^^(Q„,X„) » ^/{^g^^}- Let 

V(q)=Ei9'-Thus 

vm) = ^Y-^F\ci{t)) 

i i 

« ^ f - ^ 7r(x)f ' (q'' {t) , x'' (*'^)), 

i X 

where i*{t,x) = the maximizer of i t{q^{t),x^), i.e., 
i*{t,x) is the index of the user that bids the highest rate. 
If any q^{t) becomes very high, then one expects the users 
to transmit at highest possible rate in all channel states, and 
the system at the very least transmits at rate R. Hence, r.h.s. 
< E^ 7* — R, thus leading to 

V{ci{t))<J2l^-R<0. 

i 

Thus V( ) serves as a Lyapunov function for the o.d.e., leading 
to bounded trajectories. ■ 
Recall that an o.d.e. x{t) = f(x{t)) is cooperative if > 
for i ^ j (more generally, > and the Jacobian matrix with 
diagonal elements replaced by zero is irreducible) [40]. 
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Lemma 4: < for i 7^ i.e., (34) is a cooperative 
dynamics. 

Proof: From [7], [10], we know that the optimal policy 
is increasing in queue length for the single user case with 
i.i.d. channel fading and arrival processes. Thus, increase in 
the queue length for a user j results in an increase in Ri 
as computed by user j. As per our user selection policy at the 
base station, this results in a decrease in probability of channel 
allocation i ^ j, and hence -F'(-) for user i. ■ 
We thus have the following lemma. 

Lemma 5: The trajectories of the o.d.e. in (34) converge to 
an equilibrium for almost all initial conditions. 

Proof: Boundedness of the iterates as proved in Lemma 
3 and the cooperative property as proved in Lemma 4 allows 
us to apply Hirsch theorem (Theorem 4.1, p. 435, of [40]). 
It follows that the trajectories converge for almost all initial 
conditions. ■ 
Lemma 6: With probabiUty 1 — 0{'q), {Qn} in (32) will 
concentrate near the set of equilibria B of (34). 

Proof: As seen above, V(-) serves as a stochastic Lya- 
punov function for the Markov chain {Q„}, implying its 
positive recurrence. In particular, it will be asymptotically 
stationary. Thus it suffices to consider the stationary chain. 
Consider two time slices at n and n + M resp., n, M > 0, of 
the stationary chain. The distribution Hs of the queue lengths 
is the same at both times by stationarity. Let 77 > as 
above and pick K c such that /Xs(if) > 1 — 77. 

Let -B* denote the e-neighborhood of B. Then /Xs(Ui{g : 
g(0) = q,q{t) e B" y t > i]) = I hy Lemma 5. 
Thus we can pick an M > 1 such that with probability 
l^s{{q ■■ g(0) = q,q{t) G B' y t > M}) > 1 - rj. 
Let D = {q : q{0) = q,q{t) e B" \i t > M}. Now 
consider (33) as a constant stepsize stochastic approximation 
iteration for small rj. The convergence analysis for constant 
stepsize stochastic approximation algorithms of Chapter 9, 
[37], implies that 

^[rf(Q„+rMT,-B^)'|Q„ eD]<Cri (35) 

for some C > 0, d{-, B'^) being the Euclidean distance from 
B^. It follows that under the stationary law, 

PKQ„+rMT,s^)>e) 

= P(d(Q„+rM^,i?)>2e) 

<- 

where we use stationarity, (35) and the Chebyshev inequaUty 
to get the above inequality. ■ 
The equilibria q* of the limiting o.d.e. correspond to situa- 
tions where 

y = F\cC) Vi, (36) 

i.e., the mean arrivals and departures balance out, as they 
should. The discussion above then shows that the stochastic 
behavior fluctuates around this. Solution of (36) for m >> 1 
thus allows us to make predictions about the queue behavior. 



C. Some Optimality Properties 

We now prove that the base station user selection algorithm 
minimizes the long term sum of maximum rates. Moreover, 
we argue that the average power vector is a point on the Pareto 
frontier and hence it is Pareto optimal. Furthermore, we also 
argue that the scheme is 'fair' in the sense that users requiring 
larger system resources have to pay a higher 'price', i.e., they 
expend more power. 

1) User Selection Algorithm: The base station faces a 
'restless bandit' problem [41] with partial observations, since 
the base station has a knowledge of X„ and not of Q„. This 
problem is hard [42]. In the user selection policy suggested 
by us, the base station employs a greedy index policy treating 
the observed bids as indices. Here, we prove the optimality of 
the user selection pohcy at the base station for a certain cost 
criterion, given the user behavior. 

Lemma 7: The base station user selection policy minimizes: 




Proof: Let n(Q, X) be stationary distribution of the 
joint queue and channel state under the user selection pohcy 
proposed in the paper. Taking stationary expectations on both 
sides of the recursion 

$:«Ui = E^n-E^"+E^"+i 

i i i i 

allows us to write the following: 

i q,x i 

= Vn(q,x)max(r(9Sa;')). (38) 

q,x 

Let n(q, x) be the stationary distribution of joint queue and 
channel state induced by some other stabihzing pohcy that 
does not necessarily schedule the user bidding the highest rate 
in a slot. (It is assumed that the other policy is a stabilizing 
pohcy, but then it is clear that unstable pohcies are not 
contenders for optimality anyway.) Again, we have: 

= ^n(q,x)^xXq,x)(^^(«\xO), (39) 

i q,x i 

where (Qi is the probability of scheduling user i in 
state (q, x) under the other stabihzing pohcy; x* (q, x) > 0, 
Si X*(*1'X) = 1- This implies that: 

^n(q,x) (max(f(g\x^))) 

q,x 

= 53n(q,x)[^xXq,x)(f(<?%x'))) 

q,x \ i / 

< 5^n(q,x)(max(£'(g\ar'))). (40) 

q,x 

Hence the pohcy minimizes (37) over all stable stationary 
policies. From standard results from Markov decision theory 
([30], Chapter 8), this implies the claim. ■ 
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Remark 3: Note that £^{-, •) is the user I's poHcy, i.e., the 
function that maps the state of user i to the rate it bids, 
which in turn determines its power expenditure. Thus the 
minimization objective of the base station, which seeks to 
minimize the average maximum rate requested per slot, aids in 
conserving user power. This is not enough to show that power 
consumption per user is minimized. However, we argue that 
the scheme is Pareto optimal. 

2) Pareto Optimality of Power Vector: Once the users begin 
transmitting at stable transmission power schedule, in order to 
reduce the power consumption of say, user i, the base station 
has to reduce the average rate with which user i transmits. 
Since the delay constraint of user i must be satisfied, this can 
be done by increasing the fraction of slots allocated to that 
user. This results in decreasing the fraction of slots allocated 
to some other user(s), say, user j. Now, if user j has to satisfy 
its delay constraint, it has to increase the rate at which it 
transmits, thus increasing its power expenditure. Thus, once 
the system has stabilized, reduction in the power expenditure 
of one user is possible only at an expense of increase in the 
power expenditure of some other user(s). Therefore the long 
term power expenditure vector is a point on the Pareto frontier, 
i.e., the scheme is Pareto optimal. 

D. Game Theoretic Interpretation 

Here, we provide a game theoretic interpretation of the 
scheme and argue that it is incentive compatible. 

1) Multi-user Penalty: Consider a user i. Fix a Markov 
policy learnt by users j and the base station policy. As per 
our scheme, user i solves an average cost MDP with running 
cost function P{x,r) + \\q-5) = P{x,r) + 

where is the correct LM for the constrained MDP with the 
objective of minimizing \iTasnY>M^c<, iiY.n=iPiQn^^n) 
subject to limsupjv^^^ ;|^X;^i<3n < This can be 
interpreted as there being an appropriate 'scaling' of the LM 
because of the presence of multiple users. This is the penalty of 
operating in a multi-user environment. The upward scaling of 
the LM would naturally result in a higher power consumption 
as compared to the single user facing the aforementioned 
constrained MDP. We call the ratio ^ the multi-user penalty. 
It is clear that each user is optimal for this modified MDP 
when the policies of other users and the base station are 
fixed. On the other hand, we have already seen that the base 
station uses a policy that is optimal for cost (37) when the 
user policies are fixed. Thus this is a Nash equilibrium for a 
certain stochastic dynamical game. In fact, it is an instance 
of the economists' notion of a Markov equilibrium (See, e.g.. 
Chapter 13 of [431). 

2) Incentive Compatibility of the Scheme: The users always 
attempt to keep their average queue lengths close to their 
respective queue length constraints, i.e., the constraint is 
satisfied with equality. This is because the user's objective is 
to minimize the power expenditure. By asking for a higher rate 
than what is required, a user might achieve an average queue 
length that is much lower than the queue length constraint. 
However, as proved in [7] for single user policy, since the 
power is an increasing convex function of the rate, the rate 



should be kept as low as possible so that the average queue 
length is as large as possible (in this case, equal to the queue 
length constraint) in order to save power. Thus users will 
transmit at a rate such that the average throughput achieved 
by them is just sufficient to meet the delay constraint with 
equality, implying that there is no incentive for the users to 
lie and ask for an unnecessarily high rate. This establishes the 
incentive compatibility of the scheme. 

3) Fairness: The scheme is 'fair' in the sense that users' 
power expenditure is commensurate with their fraction of 
time slots requirement. A certain user having a high arrival 
rate or stringent delay constraint or poorer channel condition 
requires higher fraction of slots in order to satisfy the delay 
requirement. Such a user must consistently bid higher rates in 
order to obtain a higher fraction of time slots and thus ends 
up 'paying' more in terms of its long term power expenditure. 

VI. Experimental Evaluation 

We demonstrate the performance of our algorithm under the 
IEEE 802.16 [1] framework through simulations in a discrete 
event simulator. Specifically, we intend to demonstrate the 
following: 

1) The algorithm satisfies the users' delay constraints. 

2) The algorithm is efficient in terms of the power con- 
sumed for each user. This is demonstrated by comparing 
the power consumed under our algorithm with that under 
M-LWDF scheduler [24]. 

3) Performance of the algorithm under different informa- 
tion accuracies. 

A. Simulation Setup 

We consider uplink (UL) transmissions in the residential 
scenario as in [44]. Internet traffic is modeled as a web traffic 
source [44], [45]. Variable sized packets are generated at the 
application layer. Packet sizes are drawn from a truncated 
Pareto distribution. This distribution is characterized by three 
parameters: shape factor ^, mode v and cutoff threshold g. 
The probability that a packet has a size y can be expressed 
as: 

fTp{y) = -^i+j-, v<y <g 

fTp{y) = V. y>g, (41) 

where u can be calculated to be equal to: 

v={-^f,i>l. (42) 

We choose shape factor ^ = 1.2, mode v = 2000 bits, cutoff 
threshold g = 10000 bits, which provides us with an average 
packet size of 3860 bits. In each time frame, we generate the 
arrivals for all the users using Poisson distribution. Arrivals 
are generated in an i.i.d. manner across frames. We divide the 
packets into fragments at the MAC layer with each fragment 
being of size r = 2000 bits. Fragments of size less than 
2000 bits are padded with extra bits. Since all fragments are 
of equal size, we determine the transmission rate for users 
in terms of number of fragments. We simulate a Rayleigh 
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fading channel for each user. For a Rayleigh model, chan- 
nel state XMs an exponentially distributed random variable 
with mean a' and probability density function expressed as 
fx{x) = exp (^^^) , X > 0. We assume that the power 
required for transmitting z fragments of size r bits when the 
channel state is x is P{x,zt) = (2^ — l) where W 
is the bandwidth and A^o is the power spectral density of the 
additive white Gaussian noise at the receiver We assume that 
the product WNq is normalized to 1. We measure the sum 
of queuing and transmission delays of the packets and ignore 
the propagation delays. In all the scenarios described below, 
a single simulation run consists of running the algorithm for 
100000 frames and the results are obtained after averaging 
over 20 simulation runs. We discretize the channel into eight 
equal probability bins, with the boundaries specified by { (-oo, 
-8.47 dB), [-8.47 dB, -5.41 dB), [-5.41 dB, -3.28 dB), 
[-3.28 dB, -1.59 dB), [-1.59 dB, -0.08 dB), [-0.08 dB, 
1.42 dB), [1.42 dB, 3.18 dB), [3.18 dB, oo ) }. For each bin, 
we associate a channel state and the state space = { — 13 
dB, -8.47 dB, -5.41 dB, -3.28 dB, -1.59 dB, -0.08 dB, 
1.42 dB, 3.18 dB}. This discretization of the state space of 
has been justified in [8]. We assume N = 20, i.e., a system 
with 20 users and thereby 20 UL connections. We assume 
that the number of users does not change during the course 
of simulations. In the synametric case, all 20 users have same 
parameters while in the asymmetric case, users are divided 
into two groups (Group 1 and Group 2) of 10 users each 
with different parameters. We have simulated both synametric 
and asymmetric cases. However, due to space constraints, we 
include the results for the asymmetric case only. 

B. Satisfaction of Delay Constraint 

In this sub-section, we demonstrate that the delay constraint 
of each user is satisfied for various constraints and average 
channel conditions. 

Scenario 1: In this scenario, we demonstrate that the al- 
gorithm satisfies the various user specified delay constraints. 
In each frame, arrivals are generated with a Poisson dis- 
tribution with mean 0.1 packets/msec. Packet lengths are 
Pareto distributed with parameters detailed above. This re- 
sults in an arrival rate of 0.386 Mbits/sec/user. We choose 
a' = 0.4698(-3.28 dB) Mi. In each slot, we generate 
using exponential distribution with mean a'. We determine the 
channel state based on the bin that contains X' as explained 
above. 

We perform multiple experiments. The delay constraints of 
the users in Group 1 are fixed at 100 msec in each experiment, 
while the delay constraints of the users in Group 2 are fixed at 
25, 50, 75, 100, 125, 150, 175 msec in successive experiments. 
It can be observed from Figure 3 that the delay constraints 
are satisfied in for all the constraints. Moreover, from Figure 
4, it can be observed that the power expended is a convex 
decreasing function of the delay constraint imposed by the 
user. Larger delay constraint implies that lesser power is 
required to satisfy the constraint. 

'Note that our algorithm does not use this knowledge of the channel and 
arrival process model. 
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Fig. 4. Power expended with specified delay constraints 

Scenario 2: In this scenario, we demonstrate that the algo- 
rithm satisfies the user specified delay constraints for various 
channel conditions. We consider the asymmetric case where 
we maintain the average channel state for users in Group 1 
constant for all the experiments, i.e., a* = —3.28 dB, i e 
1, . . . , 10. For the users in Group 2, i.e., a' for « G 11, . . . , 20, 
the average channel state is fixed at a' = — 13 dB, —8.47 dB, 
-5.41 dB, -3.28 dB, -1.59 dB, -0.08 dB, 1.42 dB, in 
successive experiments. Average delay suffered by a user in 
Group 1 and in Group 2 and power consumed by them for the 
two cases are plotted in Figures 5, and 6 respectively. From 
Figure 5, it can be observed that the scheme is able to satisfy 
the delay constraints above a certain average channel state. The 
maximum power with which the users can transmit in any slot 
determines the capacity of the system. If the maximum power 
is high, the scheme is able to satisfy the delay constraints even 
for poor channel states. Thus, the maximum power determines 
the average channel state above which the delay constraints 
are satisfied. From Figure 6 it can be observed that better 
channel conditions result in much lesser power being required 
for satisfying the delay constraints. 

C. Comparison with M-LWDF 

Here, we compare the power consumed by our algorithm 
with the M-LWDF scheme [24]. The arrival rate of all the 
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users is fixed at 0.2702 Mbits/sec (0.07 packets/msec) for all 
the experiments while a' = 0.4698(-3.28 dB) Vi We first 
determine the average delay experienced by the users with 
different maximum power constraints (1.5 - 4.5 Watts) under 
M-LWDF scheme. We keep track of the power expended by 
each user under the M-LWDF scheme. The delays experienced 
under M-LWDF scheme are set as delay constraints for the 
proposed scheme. Figure 7 demonstrates that the proposed 
scheme satisfies the delay constraints. The comparison for 
power consumed under the two schemes is plotted in Figure 8. 
It can be seen from this figure that the proposed scheme con- 
sumes much less average power than the M-LWDF scheme. 



D. Performance under Different Information Accuracies 

Here, we study the performance of our scheme under 
different accuracies of information. Specifically, we determine 
the power consumption when 2 bits, 3 bits and 4 bits are 
employed in order to convey the rate information from the 
users to the base station. The parameters for the users are 
same as those in Scenario 1 of Section VI-B. The results are 
plotted in Figure 9. It can be seen that as the information 
accuracy is increased, the power consumption reduces. 

Remark 4: It is apparent that there exists a tradeoff between 
the information accuracy and the resources (power, bandwidth) 



o 



Fig. 9. 




40 60 80 100 120 140 16 
Delay constraint of a user in Group 2 (msec) 
Power consumed for various information accuracies 



180 



consumed in conveying the information. We are currently 
investigating this tradeoff. 

VII. Conclusions 

In this paper, we have proposed a novel scheduling algo- 
rithm for minimizing the average power of each user subject 
to individual delay constraint in a multi-user uphnk system. 
The primary difficulty in numerically determining an optimal 
policy for this problem is the large state space. To address 
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this difficulty, we have proposed a novel extension of single 
user optimal algorithm to the multi-user setting. In our ap- 
proach, the users can be thought of as bidding their rates 
to the base station which then schedules the user bidding 
the highest rate. We note that it is not in the interest of 
users to bid unnecessarily higher rates as that might result 
in higher power consumption. We prove analytically that 
the proposed algorithm ensures user queue stabiUty if the 
learning scheme converges and vice versa, and if so, satisfies 
the delay constraints of the users. Another advantage of our 
approach is that it does not require an explicit knowledge 
of the probability distributions of channel state and arrival 
processes. The algorithm is computationally efficient and has 
low communication overhead. It thus provides a powerful 
framework for upUnk scheduling. Interesting future directions 
are to explore a network situation, and on a different note, to 
provide a more complete theoretical analysis. 
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